sincFold: end-to-end learning of short- and long-range interactions in RNA secondary structure

Abstract Motivation Coding and noncoding RNA molecules participate in many important biological processes. Noncoding RNAs fold into well-defined secondary structures to exert their functions. However, the computational prediction of the secondary structure from a raw RNA sequence is a long-standing unsolved problem, which after decades of almost unchanged performance has now re-emerged due to deep learning. Traditional RNA secondary structure prediction algorithms have been mostly based on thermodynamic models and dynamic programming for free energy minimization. More recently deep learning methods have shown competitive performance compared with the classical ones, but there is still a wide margin for improvement. Results In this work we present sincFold, an end-to-end deep learning approach, that predicts the nucleotides contact matrix using only the RNA sequence as input. The model is based on 1D and 2D residual neural networks that can learn short- and long-range interaction patterns. We show that structures can be accurately predicted with minimal physical assumptions. Extensive experiments were conducted on several benchmark datasets, considering sequence homology and cross-family validation. sincFold was compared with classical methods and recent deep learning models, showing that it can outperform the state-of-the-art methods.


Introduction
Noncoding ribonucleic acid (ncRNA) molecules have emerged as crucial players in cellular processes, encompassing epigenetics, transcriptional and post-transcriptional regulation, chromosome replication, translation and protein activity and stability [1,2].Recent efforts have even explored the clinical potential of ncRNA in diagnostics, vaccines and therapies [3].This paradigm shift in our understanding of ncRNA, from being dismissed as 'transcriptional noise' prior to the 1980s to being recognized as regulators of gene expression at multiple levels, has generated an explosion of research in this field over the past few decades [4].
RNA itself consists of an ordered sequence of four basic nucleotides: adenine (A), cytosine (C), guanine (G) and uracil (U).Pairing these bases within an RNA molecule gives rise to its secondary structure, a crucial determinant of its functions and stability [5].In coding RNAs the need to maintain the reading frame during translation is constrained, leading to specific structural features that optimize protein synthesis.Instead, in the case of ncRNAs, selection acting on structure accelerates their rate of evolution, thereby challenging the secondary structure prediction.It is characterized by hydrogen bonding interactions between complementary base pairs, which typically include the canonical Watson-Crick-Franklin pairs A-U and C-G [6], along with the wobble pair G-U [7].Non-canonical interactions such as A-C or G-A can also happen, being still a big challenge in predicting RNA secondary structure [8].Basic stem-loop structures, formed by nested base pairs, are commonly observed.However, the secondary structure can also exhibit complex motifs arising from local bonding and long-range sequence interactions and other challenging tasks, such as pseudoknots [9], multi-chain [10] and multiple connections by nucleotide (multiplets) [11].
Despite the growing number of publicly accessible ncRNA sequences, a significant proportion of their true structures remains unknown [12].Secondary structures can be obtained with sophisticated experimental techniques such as X-ray crystallography, nuclear magnetic resonance [13][14][15], enzymatic probing methods such as nextPARS [16] or chemical probing such as DMS-seq [17] and SHAPE-seq [18].However, all these methods suffer from low resolution and high costs [19].Consequently, due to its cost-effectiveness the computational prediction has gained substantial relevance in biological research and biotechnological applications.
To surpass this performance ceiling, machine learning (ML) techniques, and particularly deep learning (DL), have emerged as promising alternatives [31].DL techniques have been widely noticed for structure prediction in proteins with AlphaFold [32] and more recently several methods were presented for RNA secondary structure prediction [29,[33][34][35][36]. However the available RNA datasets are very small compared with proteins; they are highly biased in several ways and pseudoknots are not consistently annotated, being a key factor in RNA structures.In [37] authors state that there are several possible ways to enable the accurate prediction of RNA structures in the near future, such as improving knowledge through more data, diversifying the data used in prediction and improving the ML methods used.In particular DL methods rely less on assumptions about the thermodynamic mechanics of folding, instead adopting a data-driven approach.Consequently, they could be better suited to identify complex structures that defy modeling using traditional techniques.However, recent systematic evaluations of techniques for comparatively assessing their performance on ncRNAs showed that DL has not yet clearly overperformed classical methods [8,30,38].
Currently, several DL approaches are available with different architectural designs, input representations, training data and optimization algorithms for parameter adjustments [39].Among these proposals, SPOT-RNA [34] was the pioneering DL method based on ensembles of convolutional neural networks (CNNs) and bidirectional long short-term memory neural networks (LSTMs).SPOT-RNA2 [40] improved its predecessor using predictions from thermodynamic models, evolution-derived sequence profiles and mutational coupling, however requiring multiple sequence alignments.Another hybrid approach was MXfold [41], combining support vector machines and thermodynamic models.Similarly, DMFold [42] and MXFold2 [29] integrated DL techniques with energy-based methods.Another method based on both DL and dynamic programming was CDPfold [33], which iteratively computes a matrix representation of possible matchings between bases according to a physical model of base interactions, and then trains a convolutional network over this matrix to predict base pairing probabilities.Upon this, dynamic programming is applied to obtain the final RNA secondary structure.
In more recent years, UFold [35] approached the secondary structure prediction problem using a well-known architecture from image segmentation, the U-Net encoder-decoder [43].It uses a 2D feature map to encode the occurrences of one of the 16 possible base pairs between nucleotides for each position in the map, including an additional channel with the matrix representation of possible matchings iteratively computed with the algorithm proposed in CDPfold.The predicted output is the contact score map between the bases of the input sequence, which goes through a post-processing step that involves solving a linear programming problem to obtain the optimum contact map.Interestingly, a very recent method, REDfold [36], reported to outperform UFold.This DL method also utilizes a U-Net encoder-decoder network to learn dependencies among the RNA sequence, together with symmetric skip connections to propagate activation information across layers and output post-processing with constrained optimization.
In this work we present sincFold, a novel end-to-end DL method for RNA secondary structure prediction for single-chain RNA sequences.Our approach is based on ResNet bottlenecks to capture both short-and long-range dependencies in the RNA sequence.Unlike other DL models, we adopt a twostage encoding process: initially, we model sequence encoding in 1D, enabling the learning of small context features and reducing computational costs; then a pairwise encoding in 2D is incorporated to capture distant relationships.Extensive experimental evaluations on two widely used ncRNA databases demonstrate that sincFold outperforms classical methods and DL state-of-the-art techniques in terms of F 1 performance.We have made the source code for sincFold freely accessible, facilitating its adoption and further development in the research community (Source code available at https://github.com/sinc-lab/sincFold).Moreover, a web service to test the trained model is provided (Webdemo available at https://sinc.unl.edu.ar/web-demo/sincFold).

The sincFold model
In order to obtain a secondary structure prediction from a standalone RNA sequence, we propose sincFold.The current approach was designed for single-chain RNA.As shown in Fig. 1, this novel DL architecture is composed of two stages: the first one learns local patterns in 1D encodings, while the second stage can learn more distant interactions in 2D.The figure represents in detail the shapes and dimensions of the data along the pipeline (top) and the neural processing blocks (bottom).
The model takes as input an RNA sequence of length L encoded in one-hot (bottom-left) so that each nucleotide type of the sequence is represented with a vector of size 4 (i.e. a one-hot codification of the four canonical nucleotides).The encoded sequence goes through a one-dimensional convolutional layer that performs a first automatic extraction of low-level features for each nucleotide.Then, identity blocks [ 44] are stacked in a 1D-ResNet.These blocks allow the model to propagate the signal and reduce vanishing gradient issues while maintaining the same sequence length.Moreover, the identity blocks make the model capable of auto-defining the number of convolutional layers needed during training.Each block is composed of two batch normalization layers, ReLU activations and convolutional layers in 1D, with bottlenecks in the features (depicted with light green in the figure).Bottlenecks reduce the learnable parameters while helping to learn more relevant features.
After the 1D bottlenecks, an M × L encoding is obtained, where M is the dimension of the feature vector of each nucleotide.Then two convolutions in 1D produce two compressed encodings of size E × L. A matrix product between one E × L matrix and the other E × L matrix transpose is made, obtaining a first 'draft' of the contact matrix in 2D (L × L).After that, the matrix is forced to be symmetric by adding its transpose.An additional channel of interaction priors is added at this point, coding different bonding strengths for C-G, U-A and G-U.
Once the information is represented in 2D, the new tensor L × L will go through a 2D-ResNet stage.Similarly to the 1D-ResNet stage, a 2D-convolutional layer is followed by 2D-ResNet blocks composed of batch normalization layers, ReLU activations and 2D convolutions.After several 2D-ResNet layers with bottlenecks the 2D pairwise encodings are f lattened to a L × L output, and its transpose is added to force symmetry.This output matrix is the final 2D prediction of the secondary structure for the RNA sequence, and the entire model can be trained with a unified cost function.A simple post-processing is applied to find the maximum activation on each row and column, thus retaining only one interaction per nucleotide.
To guide training, we propose a composed loss function where L α is the cross-entropy loss of the final prediction, L β is the cross-entropy loss of the model prediction prior to the 2D-ResNet block and L 1 is a L 1 loss of the predictions used to enforce the contact matrices sparseness.Cross-entropy is computed element by element in the matrix.The weights λ β and λ 1 of each of these terms are hyperparameters to be adjusted experimentally.
Our proposed architecture is different to existing DL models in several ways.SPOT-RNA converts to a 2D representation but only as a pre-processing stage, by outer concatenation of the onehot codification of the sequence.In this model, 1D patterns are not learned throughout the sequence.Then, the prediction of structure is obtained with an ensemble of ResNet blocks with dilated convolutions, a 2D-BLSTM (bidirectional long short-term memory) layer and a fully connected block.Furthermore, the SPOT-RNA source code for training is not available and thus it cannot be compared with other methods under the same conditions.MXFold2 has an architecture that models 1D and 2D representations though BLSTM and 2D convolution blocks, respectively.The conversion from 1D to 2D is based on a concatenation of halves of the 1D embeddings, so that different halves appear together in the corresponding coordinates of the L × L output.The choice of halves as a concatenation block only responds to the need to form a 2D representation, but has no basis in the modeling of structural connections to be predicted.Moreover, as in SPOT-RNA pre-processing, these concatenations do not include any inner products that measure similarity between 1D representations.Finally, it is important to note that MXFold2 is actually a hybrid method, which does not predict a contact matrix but four types of folding scores for each pair of nucleotides.The folding scores are integrated with the free energy parameters of Turner nearest-neighbor model.Then an optimal secondary structure is calculated using classical dynamic programming.Differently from UFold and REDfold, which use the standard U-Net originally proposed in computer vision for image segmentation, and a postprocessing step with linear programming, with sincFold we propose a novel full end-to-end architecture that models separately the 1D (short range) and 2D (long range) interaction.It is important to note that in both UFold and REDfold the conversion from 1D to 2D is, as in other models, a pre-processing stage (i.e. it is not part of the DL model).For example, in UFold pre-processing the one-hot codification of the RNA sequence is converted into a 16-channel 'image' via a Kronecker product.In contrast, sincFold learns representations from a 1D sequence, converts them to a 2D representation with a tensorial product and then learns longrange interactions through training.

Data
In the last decades, several RNA data collections appeared for benchmarking RNA folding methods [5,34,[45][46][47], including experimentally determined RNA structures.In order to evaluate sincFold and compare the performance with other state-of-theart methods, we have chosen the datasets most widely used in previous works and cited by the community.These datasets present different challenges: RNAstralign dataset [47] is the largest dataset, with 37 149 sequences and experimentally verified structures from eight large RNA families: 5S rRNAs, Group I Intron, tmRNA, tRNA, 16S rRNA, Signal Recognition Particle (SRP) RNA, RNase P RNA and Telomerase RNA.It is one of the most comprehensive RNA structure datasets available.Minimum sequence length: 30 nt.ArchiveII dataset [5] is the most widely used benchmark dataset for RNA folding methods; it is a manually curated dataset that includes a homology-aware standard split for the challenge of predicting the structure of RNA sequences belonging families not seen in training.It contains RNA structures from nine RNA families: 5S rRNAs, SRP RNA, tRNA, tmRNA, RNase P RNA, Group I Intron, 16S rRNA, Telomerase RNA and 23S rRNA.The total number of sequences is 3975.Minimum sequence length: 28 nt.TR0-TS0 dataset [34] is the same partition between train and test data that was proposed in SPOT-RNA.The sequences are from bpRNA 1.0 (Danaee et al., 2018).This dataset consists in a nonredundant set of RNA sequences with annotated secondary structure from bpRNA34 at 80% sequence-identity cutoff with CD-HIT-EST [48].This filtered dataset of 13 419 RNAs provides homology-aware data splits of 10 814 sequences for training (TR0), 1300 for validation (VL0) and 1305 for an independent test (TS0).Minimum sequence length: 30 nt.Ablation dataset: in addition, we compiled a dataset of sequences derived from the URS server [49] to be used as a small independent dataset for model optimization.That is why this dataset cannot be used in any of the tests in comparison with other methods.These sequences and secondary structures were extracted from the Protein Data Bank, consisting of 753 sequences ranging from 8 to 456 nucleotides.
As suggested in [50], sequences longer than 512 nucleotides were filtered to limit the runtime of experiments, leaving 22 611 sequences in the RNAstralign dataset and 3864 sequences in the ArchiveII dataset.Group I intron RNAs were excluded from the RNAstralign dataset because it included sequences without a unique structure.Thus, in this manuscript we will show results only for sequences with less than 512 nt.Furthermore it has to be mentioned that sincFold, by design, can predict the structure for any sequence length.However, since it was trained on the available datasets with a minimum sequence length, accurate predictions can be expected for sequences longer than the minimum length.
To assess the performance, all DL methods used in this study were re-trained from scratch with the exact same partitions for training and testing.First we perform a k-fold cross-validation with k = 5 on the Ablation, ArchiveII and RNAstralign datasets.For the ArchiveII dataset, the original k-fold split provided by the authors was used [5].Sequences were randomly divided into five independent folds of approximately the same size, and each fold was in turn taken as the test data while the remaining folds were taken as the training data.Then, we considered the structural differences between sequences used in training and testing, in order to analyze the impact of homology on performance.In the TR0-TS0 dataset, we use the provided homology-aware partitions with 80% sequence similarity cutoff.Finally, we perform a cross-family analysis (testing on unseen RNA families) using the ArchiveII dataset.

Performance measures
The focus of performance measures is on the predicted base pairs in comparison with a reference structure [51].Pairs that are both in the prediction and in the reference structure are true positives (TP), while pairs predicted but not in the true structure are false positives (FP).Similarly, a pair in the reference structure that is not predicted is a false negative (FN), and a pair that is neither predicted nor in the true structure is a true negative (TN).Methods performance is reported with the F 1 score, defined in terms of recall or sensitivity (s + ) and precision (p) as follows: The whole RNA structure can be considered as a large interaction network composed of interactions and base stackings [52].The interaction network fidelity (INF) similarity measure [53] was designed to score the similarity between the interactions of a reference RNA structure and the interactions of a predicted RNA structure.INF is defined as In [54] it was demonstrated how two structures that share a common feature (for example, a hairpin) with the exact same base pair patterns can achieve F 1 = 0.This is because similar base-pair patterns between the two secondary structures can only be shifted, but this will not be ref lected by the F 1 score.Thus, the Weisfeiler-Lehman graph kernel (WL) metric was proposed in order to capture graphs structural information by iteratively refining node labels based on their local neighborhoods.The WL metric first assigns to each node (nucleotide) in the graph (secondary structure) a label representing its local structural information.Then, a label propagation step iterates over the nodes and updates their labels based on the labels of their neighboring nodes.Finally, a hash function is computed that aggregates these labels to generate a feature vector.The WL is defined as where (G i ) represents the feature vector of graph G i obtained by aggregating the labels through the hash functions.The WLsimilarity score is sensitive to both structural and sequence-level alterations.This means that, for example, in the presence of a small shift in the prediction, the WL-similarity will provide a slightly different score.However, any classical score will be just zero because RNA secondary structures are typically evaluated with a strict comparison of predicted base pairs.Another advantage of WL is the inclusion of sequence information into structure evaluation.When there are mutation events on the sequence level, while all other measures cannot capture the mutation information, the WL-similarity decreases with the amount of sequence changes.

Distance measure for secondary structures
It is known that minor changes in RNA sequences can represent significant changes in secondary structure and, conversely, very similar structures can be obtained from quite different sequences.Thus, for analyzing results, the structural distance between data samples is more representative of the prediction challenge than a simple sequence-level distance.For this reason, the structural distance was computed using RNAdistance from the ViennaRNA package [26,55].This distance is based on the edit distance of a tree representation, in which the secondary structure is converted into a tree by assigning an internal node to each base pair and a leaf node to each unpaired digit [56].Then, a tree is transformed into another tree by a series of editing operations with predefined costs.The distance between the two trees is the smallest sum of the costs along an editing path, which is divided by the length of the longest sequence in order to obtain a normalized distance.

Ablation study and hyperparameters exploration
We conducted an ablation study to gain a deeper understanding on the contribution of each of the components of the sincFold architecture.We run several versions of the sincFold model: C1D) a baseline model with only 1D-convolutional networks; R1D) the same model replacing convolutions with 1D-residual blocks and bottlenecks; C1D+C2D) the model with the 2D-stage using only convolutional neural networks; C1D+R2D) replacing convolutions with the 2D-residual blocks in the 2D stage and R1D+R2D) with residual blocks and bottlenecks in both stages.The F 1 scores for each ablated sincFold version, from a 5-fold cross-validation on the Ablation dataset, are shown in the boxplots of Fig. 2.
It can be seen that changing the C1D to a R1D block slightly improves the median results, from a median F 1 = 0.697 to F 1 = 0.729.It can be observed that adding the 2D stage (C2D) to the output of the previous models increased their performance significantly, by 10% for each model.The F 1 raises up to 0.802 with C1D+C2D, and using a ResNet instead of a CNN in the 2D stage (C1D+R2D), performance further improves up to F 1 = 0.818.Finally, results are even further improved in the model with ResNet blocks in both stages (R1D+R2D), reaching F 1 = 0.838.
After the ablation study we conclude that ResNet blocks effectively improve the generalization capability, in comparison with simple convolutional layers.Fig. 3 presents a detailed analysis of the true positive rate of predictions in this dataset along the interconnection distance.Interestingly, when the 2D stage (green) is added to the 1D stage (blue) the model performance improves for all connections, and especially for distances longer than 200 nt.
This shows that in fact the sincFold 2D stage improves the learning of long range dependencies.Moreover, a very interesting property of ResNet blocks is that when there are many blocks available, the model is capable of automatically selecting how many of them are really necessary, skipping the non-necessary blocks during training.This reduces the learnable parameters while helping to learn more relevant features.
Using the best-performing sincFold architecture (R1D+R2D), we performed a hyperparameter space search in the Ablation dataset, exploring batch size, learning rate, the use of learning rate schedule, weights for the loss components λ β , λ 1 , architecture of the 1D-ResNet stage (kernel size and dilation, number of filters and number of layers) and the 2D-ResNet stage (kernel size, number of filters, bottleneck size and number of layers).Parameters were explored randomly [57], and the best configuration was selected for the next experiments (Supplementary Material Figure S1).

Performance according to test-train structural distance on random partitions
Fig. 4 shows the comparative results among classical folding methods (RNAfold, RNAstructure, ProbKnot, IPKnot, Linear Partition-V, LinearFold-V, LinearPartition-C and LinearFold-C), the hybrid method MXfold2, DL based methods (UFold and REDfold) and the proposed sincFold in terms of F 1 for 5-fold cross-validation on the RNAstralign dataset.All DL methods were trained and evaluated from scratch with the same dataset partitions on cross-validation.It can be seen that all classical methods have a performance between 0.633 and 0.712 of F 1 .MXFold2 combines DL and thermodynamic models and achieves better performance (median F 1 = 0.907).DL methods show even better scores, UFold reaches a median F 1 = 0.966 and REDfold arrives at median F 1 = 0.976.The proposed method, sincFold, achieves F 1 = 0.986.The variance of our method is very small, and the box is not overlapped with the performance of the other DL methods.
Regarding non-canonical interactions, their prediction is not supported by the classical methods in the comparison, neither by MXfold2 [8] nor by REDfold [36].In contrast, sincFold can indeed predict them just by omitting the post-processing step.For this dataset, UFold with the option to include non-canonical base pairs has a global F 1 = 0.971, with F 1 = 0.972 for canonical and F 1 = 0.515 for non-canonical interactions.In the same conditions, sincFold achieves a global F 1 = 0.979, with F 1 = 0.981 for canonical and F 1 = 0.940 for non-canonical interactions.
Fig. 5 shows the comparative results among classical folding methods, hybrid method, DL methods and the proposed sincFold, in terms of F 1 for the ArchiveII dataset.As in the previous result, classical methods have a median performance below F 1 = 0.620.In this case, MxFold2 achieves F 1 = 0.738, UFold has a median F 1 = 0.855 and REDfold arrives at F 1 = 0.831.The proposed method sincFold achieves the highest median F 1 = 0.913.It can   be seen that in both datasets, our proposed method achieves a significantly better performance than classical methods and state-of-the-art DL methods.
We would like to mention two other DL methods that have recently appeared for RNA secondary structure prediction, AliNA [58] and RiNALMo [59].We could not include AliNA in previous figures because this model cannot be re-trained and it has a restriction of maximum length of 256 nucleotides.Thus, we compare sincFold with the same restriction: for RNAstralign, AliNA achieved F 1 = 0.910 and sincFold F 1 = 0.994; and for ArchiveII, AliNA achieved F 1 = 0.809 and sincFold F 1 = 0.951.In the case of RiNALMo, it is important to notice that both ArchiveII and Figure 6.Detailed F 1 performance for each method according to the mean lengths of the sequences, from shorter (left) to longer (right) sequences.
RNAstralign were used for pre-training this model.Since RiNALMo is a large language model, it is not feasible to retrain it using the strict cross-validation partitions proposed in this work, thus not allowing a fair comparison.Nevertheless, we fine-tuned the pretrained RiNALMo in the train partitions of our k-fold experimental setup, achieving a test F 1 = 0.990 in RNAstralign and F 1 = 0.949 in ArchiveII.As expected, RiNALMo results are slightly above those obtained by sincFold because full datasets were used in its pre-training.Notably, sincFold can achieve a high performance without using information about the test sequences during any stage of model training.
The detailed performance for each method according to the lengths of the test sequences, from shorter (left) to longer sequences (right), is analyzed in Fig. 6.The light blue bars indicate the proportion of each bin of lengths in the dataset.Here it can be seen that for shorter sequences, all methods have average performance above F 1 = 0.60, being particularly good at this task all DL methods, with F 1 > 0.90.As sequence length is increased, classical methods lower the performance, while DL methods are less affected, maintaining F 1 > 0.75 in most cases and sincFold always being the best for long sequences, achieving F 1 = 0.85 for sequences between 300 and 400 nucleotides.At the extreme, for sequences longer than 400 nucleotides, sincFold is still better than other methods despite the few examples available to learn from, achieving a median F 1 = 0.74, which is superior to the average performance of classical methods in the shortest sequences.
It is well known that at random partitions there can be sequences with high similarity between training and testing partitions, thus methods can show overly optimistic results.To have more insights on the sincFold performance in this regard, we report comparative results by analyzing the sequences following the secondary structure distance between test and train partitions, as shown in Fig. 7. Instead of just making one single partition with a certain sequence identity level, we have analyzed a full range of structural similarities.The distance between two structures was computed using RNAdistance from the ViennaRNA package [26] as explained in Section 3.3.For each test sequence, the test-train distance was defined as the minimum structural distance between this test sequence and all the sequences in its corresponding training fold.Then, test sequences with similar structural distance were grouped into bins to obtain the x-axis in the figure.Ranges of structural distances are presented from large (very-hard) test-train distances (left) down to low (very-easy) test-train distances (right).The light blue bars indicate the proportion of test sequences in each bin of structural distances.It can be clearly seen that as the test-train distance diminishes, all methods (including the classical, non-learnable ones) improve performance.This also makes evident that structures on the left side are really harder to predict, even for models that are not trained and thus they are agnostic to the test-train structural distances.For the 23 structures in the first two bins of distances all methods have median F 1 < 0.50.It can be seen that for distances between 0.40 and 0.25, both classical and DL methods have again a low performance F 1 ∈ (0.30, 0.60).In the middle cases, from 0.25 to 0.20 distance, DL methods are slightly better than classical ones.Finally, for the lowest test-train structural distances (< 0.15), DL methods are clearly better for RNA secondary structure prediction, being sincFold the best method in all cases, improving classical methods from a distance of 0.25 and all other trainable methods from 0.20.These trends can be explained by two facts.First, the abundance of structure samples benefits DL models more than the classical ones because the former have more cases to learn from.Besides, the benefits for the classical methods are indirect since they do not learn, but were developed looking at the most abundant or popular structures that therefore better fit the thermodynamic models.Secondly, based on the advantage of structure abundance for data-driven approaches, sincFold is the one that best takes advantage of the ability to learn from more distant samples, regardless of how much is known about the thermodynamics of the molecules.This is evident even when distance is around 0.25 and thus far from overfitting from training samples.

Homology-aware validation
For a deeper performance analysis of the methods considering homology between training and testing partitions, in this section we performed experiments with a more rigorous control of homology, instead of using random partitions such as the kfold results in Section 4.2.For a deeper performance analysis of the methods considering homology between training and testing partitions, in this section we performed experiments with a more rigorous control of homology, instead of using random partitions.Table 1 shows the results of testing models in a nonredundant set of RNA sequences at 80% sequence-identity cutoff (TR0-TS0 partitions).The table reports comparative results on the TS0 test set according to recall, precision, F 1 , WL and INF metrics.In all cases, the last three metrics are consistent and show that sincFold is the best method to predict RNA structures with low homology to the training set, and there is almost a 10% performance gap with the classical methods.
A deeper and detailed analysis of sincFold prediction in special types of structures and connections was made regarding motifs, pseudoknots and multiplets.In Supplementary Table S2 the performance for pseudoknots, stems, hairpin loops, bulges, internal loops, dangling ends and multi-loops can be seen.The results clearly show that sincFold outperforms all other methods in most cases.Due to the fact that none of the datasets of this study contain multiplets, we used another partition of bpRNA that For benchmarking inter-family performance, a family-fold cross-validation in the ArchiveII dataset was performed.That is, one family is left out for testing per cross-validation fold and the rest of the families are used for training as in [50].This eliminates most of the homology to the training set, providing a hard measure of performance and, thus, allows estimating future performance on novel RNAs that do not belong to any well-known family.
Table 2 shows the average inter-family performance comparison with several metrics between sincFold and other classical, hybrid and pure DL methods for RNA secondary structure prediction.It can be seen that all the metrics used consistently indicate that sincFold is the best DL model to predict novel RNA structures of families of RNA never seen during training.As seen previously, performance of the hybrid method is in-between classical and DL methods.Obviously classical methods obtain the best performance here since they do not fully comply with the cross-family validation.This is because they use constraints and thermodynamic parameters that have been experimentally determined from the hairpin loops and other important structures that were most frequently found in most of the RNA families in this dataset [60][61][62][63][64].In terms of F 1 , the difference between the best DL method (sincFold) and the best classical method (LinearPartition-C) is 0.191, and 0.193 in the case of INF.However, looking at the WL metric this gap is much lower, being 0.098.This suggests that when measuring the graph structural information of the predictions, DL and classical methods are close in performance for inter-family validation.
Table 3 shows the detailed performance of DL methods for each family in the ArchiveII dataset.Full results for all methods and all measurements for each family can be found in the Supplementary Material, Table S1.The nine RNA families are characterized in terms of number of samples, average length, structural distance and sequential distance to the other families.
It can be seen that when the grp1 family is used as the test set, all methods have a moderate to low performance.The best DL method here achieves F 1 = 0.429.This is a family with a very low number of examples, which have a mean sequence length that is longer than most of the other families, with a moderate structural distance to the other families and high sequence distance to the rest of the dataset.In the case of the tmRNA family, all methods have low performance as well, but here both sincFold and UFold achieve the best result.In spite of having more testing examples (and thus the training set is much smaller), the characteristics of this family are similar to grp1 regarding structure and sequence distance to the training set, while sincFold achieves similar results.The tRNA family is the one with the lowest sequence length, having a large number of examples.In this case, while the other DL methods have low performance, here sincFold achieves a performance of F 1 = 0.685 that is very close to the performance of many classical methods (Table S1).The srp and telomerase testing families are the hardest ones.The srp family, which is indeed very different from the rest of the families regarding sequence distance and structural distance, is better predicted by sincFold.The telomerase family has a very low number of samples, which are indeed very different from the rest of the families regarding mean sequence length (those are the longest sequences, almost double the average).In the case of the 16s and 23s families, also sincFold provides the best predictions.
In summary, sincFold shows improved performance in comparison with the other DL methods in the prediction of tmRNA, tRNA, srp, RNAseP and 16s families, that is, five out of nine families.This is further evidence of the improved generalization capability that sincFold provides relative to the state-of-the-art DL methods.In Section 4 of the Supplementary Material, we show sample predictions of the methods in the inter-family cross-validation experiment.Several cases are provided, including a case outside the ArchiveII dataset.

Conclusions
In this work, we presented sincFold, an end-to-end DL model, that can accurately predict the secondary structure from an RNA sincFold | 9 sequence without requiring multi-sequence alignments, or any other pre-processing of the input sequences.Local and distant relationships can be learnt effectively using a sequential 1D-2D architecture.Based on ResNet blocks, bottlenecks layers and a 1D-to-2D projection, it has proven to be better suited to identify structures that might defy traditional modeling while reducing the effective number of trainable parameters.We show that sinc-Fold outperforms other methods even with moderate structural distances between train and testing sequences.Results also show that sincFold, due to its capability for capturing a wide range distances in interactions, is significantly better than all other methods for the secondary structure prediction also in longer ncRNA sequences (more than 200 nucleotides).In an inter-family evaluation, sincFold performed better than other state-of-the-art DL approaches, showing that RNA structure predictions can still be improved with trainable methods.

Key Points
• sincFold is an end-to-end DL model that can accurately predict the secondary structure from an RNA sequence.• Local and distant relationships can be learnt effectively using a sequential 1D-2D architecture based on residual networks.• sincFold learns internal representations from 1D and converts them to a 2D representation with a tensorial product to learn the long-range interactions in the following layers.• Experimental setup includes random folds, low homology partitions and inter-family cross-validation.• sincFold performed better than other state-of-the-art DL approaches in several datasets.

Figure 1 .
Figure 1.The end-to-end architecture of sincFold.Top: data f low with its shapes and dimensionality in each point of the architecture, from the [4 × L] one-hot encoded RNA sequence at the input to the [L × L] connection matrix at the output.Bottom: neural processing blocks depicted as differentiable layers.
in the same way as the well-known Matthews correlation coefficient.When the prediction reproduces exactly the base interactions of the reference structure, then |FP| = |FN| = 0, |TP| > 0, and thus INF = 1.When the prediction does not reproduce any of the interactions of the reference structure, then INF = 0, since |TP| = 0.

Figure 2 .
Figure 2. Ablation study on each of the components of the sincFold architecture.Each box has the F 1 scores from a 5-fold cross-validation on the Ablation dataset.

Figure 3 .
Figure 3. True positive rate for each interaction distance, comparing the model with only the first stage and both stages.Resnet-based model with only the first stage (R1D) and both stages (R1D+R2D).

Figure 4 .
Figure 4. Comparative results among classical folding methods, DL-based folding methods and sincFold, for a 5-fold cross-validation on the RNAstralign dataset.Horizontal scale was adjusted to improve visualization.

Figure 5 .
Figure 5. Comparative results among classical folding methods, DLbased folding methods and sincFold, for cross-validation on the ArchiveII dataset.

Figure 7 .
Figure 7. Mean F 1 scores for each method according to test-train structural distance, from large distances (left) to low distances (right).The bars indicate the proportion of each bin in the dataset.

Table 1 .
Performance for methods on bpRNA.The TR0 partition is used for training and the TS0 partition for testing

Table 2 .
Inter-family performance comparison in ArchiveII, between sincFold and classical, hybrid and DL methods

Table 3 .
Inter-family performance detail of the F 1 score in ArchiveII for each RNA family in the comparison between sincFold and other DL RNA secondary structure prediction methods 1 = 0.45, and a multiplets recall s + = 0.20.